Working with big data

Session 5: Interactive and dynamic graphics

Professor Di Cook

Department of Econometrics and Business Statistics

Outline

  • Foundations of interactivity
  • How to make interactive graphics
  • Graphical user interfaces (GUI)
  • Interactive tables
  • Animation
  • Institutional workflow
  • Wrapping up

Why interactive plots

Reasons

  • Engage the reader and allow some choice in what to examine about the data or model.
  • De-clutter the information presented, by only showing some aspects of the data on-demand.
  • Too much information to present in a single plot, so provide multiple plots where the information is linked.
  • Re-scale information to change the focus on-demand.
Code
gp <- gapminder |> 
  filter (year == 2007) |>
  ggplot(aes(x=lifeExp, 
             y=gdpPercap,
             label=country,
             colour=continent)) +
  geom_point() +
  scale_colour_discrete_divergingx(palette = "Zissou 1") +
  scale_y_log10("gdpPercap ('000)",
                breaks = seq(0, 50000, 10000), 
                labels = seq(0, 50, 10)) +
  theme(axis.title = element_text(family="Helvetica"),
        axis.text = element_text(family="Helvetica"),
        legend.title = element_text(family="Helvetica"),
        legend.text = element_text(family="Helvetica")) 
gp + geom_text() +
  ggtitle("Too cluttered")

Considerations

Interactivity means the user can directly change aspects of the plot using mouse or keyboard controls.

Animation is an alternative to interactivity that keeps control with the developer rather than the reader.


Keep in mind

  • Interactivity like selection should be precise.
  • Response needs to be fast.
  • Be careful not to inflate plot file size when including in reports or presentations.

Different types of interactivity

Types

  • mouse-over: information shows as cursor moves over plot. This is the simplest and most commonly implemented. Information is transient.
  • selection: items in plot selected using a click, rectangular or lasso area. Tends to be persistent and remains visible when cursor moves.
  • pan/zoom: change the area of focus, to be smaller or larger. A more sophisticated system might allow logical pan/zoom to particular subsets.
  • linking: multiple plots visible, selection on one provokes a reaction in the others also. A more sophisticated system allows selection using categorical variables, one-to-many or many-to-one linking. Allows relationships between many variables to be undeerstood.
  • graphical user interface (GUI): indirect control, but precise using menus, slides, checkboxes, text entry.

Mouse-over

Code
ggplotly(gp, width=700, height=550) |>
  config(displayModeBar = FALSE)

Notice also the subsetting legend.

Mouse-over is very easy to find in many, many software.

Pan/zoom

Code
ggplotly(gp, width=700, height=550) |>
  config(
         modeBarButtonsToRemove = c('select', 'zoomIn',
                                    'zoomOut', 'autoScale',
                                    'resetScale'))

Life expectancy in 2007 from the gapminder data.

Code
# Filter gapminder data for a specific year (e.g., 2007)
gapminder_2007 <- gapminder |>
  filter(year == 2007) |>
  select(country, lifeExp)

# Get world map data
world_sf <- rnaturalearth::ne_countries(scale = "medium", returnclass = "sf") |>
  select(name, geometry)

# Ensure the CRS is WGS 84 (Leaflet's requirement)
world_sf <- st_transform(world_sf, crs = 4326)

# Handle potential country name mismatches
world_sf <- world_sf |>
  mutate(name = recode(name, 
                       "United States" = "United States of America",
                       "Congo" = "Congo, Rep.",
                       "Democratic Republic of Congo" = "Congo, Dem. Rep.",
                       "Slovakia" = "Slovak Republic",
                       "Myanmar" = "Burma",
                       "Egypt" = "Egypt, Arab Rep.",
                       "Yemen" = "Yemen, Rep."))

# Merge the datasets
map_data <- world_sf |>
  left_join(gapminder_2007, by = c("name" = "country")) |>
  # Remove countries with no life expectancy data in the chosen year
  filter(!is.na(lifeExp))

# Define a color palette
pal <- colorNumeric("RdYlGn", domain = map_data$lifeExp)

# Create the map
leaflet(data = map_data) |>
  addProviderTiles(providers$CartoDB.Positron) |> # Add a base map tile
  addPolygons(
    fillColor = ~pal(lifeExp),
    weight = 1,
    opacity = 1,
    color = "white",
    dashArray = "3",
    fillOpacity = 0.7,
    highlight = highlightOptions(
      weight = 3,
      color = "#666",
      dashArray = "",
      fillOpacity = 0.9,
      bringToFront = TRUE),
    popup = ~paste0("<strong>", name, "</strong><br/>Life Expectancy: ", round(lifeExp, 2), " years") # Add interactive popups
  ) |>
  addLegend(pal = pal, values = ~lifeExp, opacity = 0.7, title = "Life Expectancy (Years)",
            position = "bottomright")

Selection

Code
set.seed(802)
gapminder_ts <- gapminder |>
  as_tsibble(index=year, 
             key=c(country, continent)) |>
  sample_n_keys(50)
gphk <- highlight_key(gapminder_ts, ~country)

gpl <- ggplot(gphk, aes(x=year, 
                             y=lifeExp, 
                             group=country)) +
        geom_line() +
        ylab("Life Expectancy") +
        ggtitle("click on a line to highlight a country") +
        theme(axis.title = element_text(family="Helvetica"),
          axis.text = element_text(family="Helvetica"),
          legend.title = element_text(family="Helvetica"),
          legend.text = element_text(family="Helvetica"),
          title = element_text(family="Helvetica"))

ggpl <- ggplotly(gpl, height = 700, width = 1200) |>
  config(displayModeBar = FALSE)
        
highlight(ggpl)

Linking multiple plots

Code
gggs <- ggplot(gphk, aes(x=continent, y=country)) +
  geom_point() +
  xlab("") + ylab("") +
  ggtitle("rectangular selection") +
  theme(axis.title = element_text(family="Helvetica"),
      axis.text = element_text(family="Helvetica"),
      legend.title = element_text(family="Helvetica"),
      legend.text = element_text(family="Helvetica"),
      title = element_text(family="Helvetica")) 

gggspl <- ggplotly(gggs, width=500, height=500) |>
  config(displayModeBar = FALSE) |>
  highlight(on = "plotly_selected", 
            off = "plotly_doubleclick") 

ggpl2 <- ggplotly(gpl, height = 500, width = 800) |>
  config(displayModeBar = FALSE) |>
  highlight(on = "plotly_selected", 
            off = "plotly_doubleclick") 
  
bscols(gggspl, highlight(ggpl2), widths = c(5, 7))

Graphical user interface (GUI) control

Code
gpl <- ggplot(gphk, aes(x=year, 
                             y=lifeExp, 
                             group=country)) +
        geom_line() +
        ylab("Life Expectancy") + xlab("") +
        theme(axis.title = element_text(family="Helvetica"),
          axis.text = element_text(family="Helvetica"),
          legend.title = element_text(family="Helvetica"),
          legend.text = element_text(family="Helvetica"),
          title = element_text(family="Helvetica"))
ggpl2 <- ggplotly(gpl, height = 500, width = 1000) |>
  config(displayModeBar = FALSE)
  
bscols(widths = c(4, 7),
    filter_select("country", "country", gphk, ~country, multiple = TRUE),
  ggpl2)

GUI elements

GUIs provide explicit control over a small range of interactions.

  • Menu: for a medium number of categories
  • Slider: numeric values or range
  • Checkbox: for a small number of categories

are available in crosstalk.


Using sound

Vowel explorer example

Available software

Software list

More examples with plotly

The plotly package in R has two interfaces:

  • plot specification via plotly
  • translating ggplot2 plots and adding interactive elements

and creates interactive plots with javascript.

penguins_std <- penguins |>
  rename(bl = bill_len,
         bd = bill_dep,
         fl = flipper_len,
         bm = body_mass) |>
  select(species, bl:bm) |>
  na.omit()
  
plot_ly(data = penguins_std, 
        x = ~fl, 
        y = ~bl, 
        color = ~species, 
        size = 3, 
        width=600, height=450)

ggplot2 and plotly

gg <- ggplot(data=penguins_std, aes(x = fl, 
                                    y = bl, 
                                    colour = species)) +  
  geom_point(alpha=0.5) + 
  geom_smooth(method = "lm", se=F)
ggplotly(gg, width=600, height=490)

Maps

data(canada.cities, package = "maps")
viz <- ggplot(canada.cities, aes(long, lat)) +
  borders(regions = "canada") +
  coord_equal() +
  geom_point(aes(text = name, size = log2(pop)), 
             colour = "red", alpha = 1/4) +
  theme_map()
ggplotly(viz)

Not all ggplot2 geoms are supported in plotly, but when they are, they just work out of the box

Modifying plotly

plotly uses elements of crosstalk to provide additional interactivity, such as linked highlighting

txh_shared <- highlight_key(txhousing, ~year)

p <- ggplot(txh_shared, aes(month, median)) +
   geom_line(aes(group = year)) + 
   geom_smooth(data = txhousing, method = "gam") + 
   scale_x_continuous("", breaks=seq(1, 12, 1),
        labels=c("J", "F", "M", "A", "M", "J", 
                 "J", "A", "S", "O", "N", "D")) +
   scale_y_continuous("Median price ('00,000)", 
                      breaks = seq(0,300000,100000),
                      labels = seq(0,3,1)) +
   facet_wrap(~ city)

gg <- ggplotly(p, height = 800, width = 1100) |>
   plotly::layout(title = "Click on a line to highlight a year")

highlight(gg)

The power of crosstalk

Code
tourism_shared <- tourism |>
  as_shared_tsibble(spec = (State / Region) * Purpose)

tourism_feat <- tourism_shared |>
  features(Trips, feat_stl)

p1 <- tourism_shared |>
  ggplot(aes(x = Quarter, y = Trips)) +
  geom_line(aes(group = Region), alpha = 0.5) +
  facet_wrap(~ Purpose, scales = "free_y") +
  theme(axis.title = element_text(family="Helvetica"),
        axis.text = element_text(family="Helvetica"),
        legend.title = element_text(family="Helvetica"),
        legend.text = element_text(family="Helvetica"))
p2 <- tourism_feat |>
  ggplot(aes(x = trend_strength, y = seasonal_strength_year)) +
  geom_point(aes(group = Region)) +
  xlab("trend") + ylab("season") +
  theme(axis.title = element_text(family="Helvetica"),
        axis.text = element_text(family="Helvetica"),
        legend.title = element_text(family="Helvetica"),
        legend.text = element_text(family="Helvetica"),
        plot.title = element_text(family="Helvetica"))
subplot(
    ggplotly(p1, tooltip = "Region", width = 1200, height = 600) |>
  config(displayModeBar = FALSE),
    ggplotly(p2, tooltip = "Region", width = 1000, height = 500) |>
  config(displayModeBar = FALSE),
    nrows = 1, widths=c(0.5, 0.5), heights=1) |>
  highlight(dynamic = FALSE)

Tourism records across Australian regions.

The shared data objects from crosstalk make linking between plots easier!

Case study: mapping COVID in Victoria (1/4)

Constructing a choropleth

Code
# Read the data
# Replace null with 0, for three LGAs
# Convert to long form to join with polygons
# Make the date variables a proper date
# Set NAs to 0, this is a reasonable assumption
covid <- read_csv("data/melb_lga_covid.csv") |>
  mutate(Buloke = as.numeric(ifelse(Buloke == "null", "0", Buloke))) |>
   mutate(Hindmarsh = as.numeric(ifelse(Hindmarsh == "null", "0", Hindmarsh))) |>
   mutate(Towong = as.numeric(ifelse(Towong == "null", "0", Towong))) |>
  pivot_longer(cols = Alpine:Yarriambiack, names_to="NAME", values_to="cases") |>
  mutate(Date = ydm(paste0("2020/",Date))) |>
  mutate(cases=replace_na(cases, 0))
Code
# Case counts are cumulative, keep only latest
covid <- covid |>
  filter(Date == ymd("2020-10-20"))
Code
load("data/lga.rda")

covid_tot <- covid |>
  left_join(lga, by=c("NAME" = "lga")) |>
  st_as_sf()

# Make choropleth map, with appropriate colour palette
cm1 <- ggplot(covid_tot) + 
  geom_sf(aes(fill = cases, label = NAME),
    colour="grey80") + 
  scale_fill_distiller("Cases", 
    palette = "PuBuGn",
    direction=1) + 
  theme_map() +
  theme(legend.position="bottom")
cm1

Case study: mapping COVID in Victoria (2/4)

A cartogram expands a geographic are relative to the population in the area.



See more on cartograms here.

A better solution for Australia is needed, though.

Code
pop <- read_xlsx("data/VIF2019_Population_Service_Ages_LGA_2036.xlsx", sheet=3, skip=13, col_names = FALSE) |>
  select(`...4`, `...22`) |>
  rename(lga = `...4`, pop=`...22`) |>
  filter(lga != "Unincorporated Vic") |> 
  mutate(lga = str_replace(lga, " \\(.+\\)", "")) |>
  mutate(lga = ifelse(lga == "Colac-Otway", "Colac Otway", lga)) 

covid_tot <- covid_tot |>
  left_join(pop, by=c("NAME" = "lga")) 

covid_tot <- covid_tot |>
  mutate(cases_per10k = cases/pop*10000,
         lcases = log10(cases + 1)) 

covid_tot_carto <- covid_tot |> 
  st_transform(3395) |> 
  cartogram_cont("pop") |>
  st_transform("WGS84")   
  
covid_tot_carto <- st_cast(covid_tot_carto, "MULTIPOLYGON") 

cm2 <- ggplot(covid_tot_carto) + 
  geom_sf(aes(fill = cases, label=NAME),
    colour="grey80") + 
  scale_fill_distiller("Cases", palette = "PuBuGn",
                       direction=1) + 
  theme_map() +
  theme(legend.position="bottom")  
cm2 

Case study: mapping COVID in Victoria (3/4)

Code
# Placement of hexmaps depends on position relative to
# Melbourne central
data(capital_cities)
covid_hexmap <- create_hexmap(
  shp = covid_tot,
  sf_id = "NAME",
  focal_points = capital_cities, verbose = TRUE)

# Hexagons are made with the `fortify_hexagon` function
covid_hexmap_poly <- covid_hexmap |>
  fortify_hexagon(sf_id = "NAME", hex_size = 0.1869) |>
  left_join(covid_tot, by="NAME") # hexmap code removed cases!
cm3 <- ggplot() +
  geom_sf(data=covid_tot, 
          fill = "white", colour = "grey80", size=0.1) +
  geom_polygon(data=covid_hexmap_poly, 
               aes(x=long, y=lat, group=hex_id, 
                   fill = cases, 
                   colour = cases,
                   label=NAME), size=0.2) +
  scale_fill_distiller("Cases", palette = "PuBuGn",
                       direction=1) +
  scale_colour_distiller("Cases", palette = "PuBuGn",
                       direction=1) +
  theme_map() +
  theme(legend.position="bottom")
cm3


Learn more about hexagon tiling that works better for Australia here.

Case study: mapping COVID in Victoria (4/4)

Adding interaction

Code
cm1 <- cm1 + theme(legend.position = "none")
ggplotly(cm1, width=800, height=600) |>
  config(displayModeBar = FALSE)
Code
cm2 <- cm2 + theme(legend.position = "none")
ggplotly(cm2, width=800, height=600) |>
  config(displayModeBar = FALSE)
Code
cm3 <- cm3 + theme(legend.position = "none")
ggplotly(cm3, width=800, height=600) |>
  config(displayModeBar = FALSE)

Tables

Resources

See latest Table Contest for the most recent innovations in table contruction.

Animation

How and why

  • The package gganimate (Lin-Pederson) allows one to make and save animations, based on building up from ggplot2.
  • Animations are different from interactive graphics in that the viewer does not have any control.
  • Useful for different important stages of a visualisation (e.g. time) and to keep track of how different visualisations are related
  • Often good for talks and visual explanations

Try examples at gganimate website and Mitch, Ursula and Nick’s NUMBAT tutorial.

Institutional workflow

Example: United States Department of Agriculture

Three apps, to communicate information about soybean breeding in the USA:

Continuously deployed for more than a decade!

Three sources of data:

  • Next-generation sequencing DNA-seq on 79 lines: DNA sequencing libraries were prepared using TruSeq DNA sample prep and NuGENs unamplified prep kits (Illumina Inc., San Diego, CA and NuGEN Technologies Inc., San Carlos, CA).
  • Field yield trials: 30/79 + 138 ancestral lines
  • Breeding literature, what lines were bred to produce what line

Process

  1. Connection to database
  2. Reproducible reports
  3. Deployment

Within organisation, maintain

  • Github repos for versioning content and sharing locally developed packages, templates for document styling, eg monash templates, teaching
  • (optional) local CRAN-like package server
  • shiny server for app delivery, eg Monash EBS (shown in session )

Resources